[GPU] New QIPC ops for block by hughperkins · Pull Request #684 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-05-10T20:56:25Z

Summary

Closes the block-level QIPC gaps listed in perso_hugh/doc/qipc/qipc_gaps_block.md by adding three new families of block-scope primitives under qd.simt.block, stacked on the new subgroup primitives branch (hp/new-qipc-ops-subgroup).

block.reduce_{add,min,max} / block.reduce_all_{add,min,max} — CUB-style two-stage warp-reduction tree (per-warp subgroup.reduce_* into shared memory, second pass reduces the per-warp partials). _all variants broadcast the result to every thread via shared memory.
block.{inclusive,exclusive}_{add,min,max} — CUB-style block scan via warp scans (subgroup.{inclusive,exclusive}_* per warp, exclusive scan of warp aggregates in shared memory, fold prefix back into per-thread results).
block.radix_rank_match_atomic_or — CUB's BlockRadixRankMatchEarlyCounts ATOMIC_OR path: per-warp histogram via atomic_add on shared memory, block-wide exclusive prefix over digit columns, intra-warp match via atomic_or on a per-digit lane mask, leader election via clz/popcnt, leader's atomic_add plus subgroup.shuffle broadcast for the warp offset.

Files

python/quadrants/lang/simt/block.py — implementations (~370 LoC)
tests/python/test_simt.py — 78 new parametrized tests (36 reduce + 36 scan + 6 radix-rank), all arch=qd.gpu
docs/source/user_guide/block.md — user-facing docs with API tables, semantics, costs, and examples for each new primitive

Notes / gotchas resolved during development

CUDA cache staleness on first build. After pulling the subgroup branch, the qdcache contained pre-shuffle-up kernels and surfaced as LLVMRuntime function subgroupShuffleUp not found. rm -rf ~/.cache/quadrants/qdcache/* fixes it; not a code bug.
SPIR-V FindSMsb vs FindUMsb (commit d3e73bab4). leader = 31 - clz(cast(bin_mask, i32)) produced wrong leaders on Vulkan when bin_mask had bit 31 set (e.g., a single-lane match in lane 31): FindSMsb on a negative i32 returns the position of the highest 0-bit, not the 1-bit, so the leader was elected as lane 1 instead of lane 31 and non-leader lanes read warp_offset 0 from a non-leader lane. Passing u32 directly to clz dispatches FindUMsb and matches CUDA's __clz.
SPIR-V subgroupBarrier has no memory semantics. OpControlBarrier(ScopeSubgroup, ScopeSubgroup, MemorySemantics=0) does not publish shared-memory writes, unlike CUDA's __syncwarp. radix_rank pairs every warp-sync inside the match phase with subgroup.mem_fence() (_warp_sync_fence helper) so the lane-mask publish/clear sequence has correct release/acquire ordering on Vulkan / Metal.
lane_id = invocation_id() was wrong on wave64 (commit bac3f9f41). On CDNA AMDGPU the hardware wave is 64 lanes wide but log2_warp=5 partitions a block into 32-lane logical warps, so invocation_id() == 0 only fires once per hardware wave instead of once per logical warp — every other logical warp's aggregate was silently dropped. Replaced with tid & (WARP_SIZE-1) (CUB's threadIdx.x & 31 recipe). Wave32 backends are unaffected.

Tests

CUDA (RTX, single-GPU rtx-high): 82/82 -k block passing
Vulkan (RTX, single-GPU rtx-high): 81 passed + 1 skipped, -k block
AMDGPU CDNA (MI300X amdcloud): partial — see follow-up below

AMDGPU CDNA wave64 follow-up (not blocking this PR)

block.reduce / block.scan work on AMDGPU only when the block is exactly one logical warp (block_dim == 32 with log2_warp=5). Multi-warp blocks fail because shuffle_down on AMDGPU's ds_bpermute lowering wraps OOB indices modulo the wave size — i.e. on a wave64 wave, shuffle_down(value, 16) from lane 48 reads lane 0 instead of returning the lane's own value (CUDA's __shfl_down_sync semantics with width=32). subgroup.reduce_* / subgroup.{inclusive,exclusive}_* inherit the same constraint; the parent subgroup branch documents log2_size must equal the hardware wave size on CDNA.

block.radix_rank_match_atomic_or is wave32-only by construction: the bin_mask atomic_or + clz + popcnt leader pattern is keyed on a u32 lane mask. Wave64 would need a u64 bin_mask, u64 clz / popcnt, and 64-lane subgroup.shuffle for the warp_offset broadcast.

Both follow-ups are tracked separately and are not in scope for closing the QIPC block gaps (QIPC currently consumes these on CUDA and Vulkan).

Test plan

CUDA: QD_WANTED_ARCHS=cuda pytest tests/python/test_simt.py -k block
Vulkan: QD_WANTED_ARCHS=vulkan pytest tests/python/test_simt.py -k block
AMDGPU: documented as wave32-only / single-warp follow-up
Metal CI

… reduce) Ports qipc's `qipc/_src/core/block/reduce.py` into Quadrants as `qd.simt.block.reduce_{add,min,max}` (lane-0-only) and `qd.simt.block.reduce_all_{add,min,max}` (broadcast). Closes one of the three block-tier rows in `qipc_gaps_block.md`. Strategy is CUB's `BLOCK_REDUCE_WARP_REDUCTIONS`: per-warp `shuffle_down` tree reduce -> lane 0 of each warp publishes the warp aggregate into a `SharedArray(NUM_WARPS, dtype)` -> `block.sync()` -> thread 0 sequentially folds the warp aggregates with the same operator. The `reduce_all_*` flavour adds one extra `block.sync()` plus a one-slot shared-memory broadcast. When the block is exactly one warp the shared-memory path is short-circuited at trace time via a `impl.static(NUM_WARPS == 1)` guard, so single-warp blocks pay only the per-warp tree. Generic `block.reduce(value, tid, block_dim, log2_warp, op, dtype)` and `block.reduce_all(...)` accept an arbitrary template binary `op` for custom monoids; the `_add`/`_min`/`_max` shortcuts wire in the binary ops already exported by `simt/subgroup.py`. API mirrors qipc: the caller passes `tid` explicitly (Quadrants does not yet have a portable in-kernel `block.thread_idx()` on CUDA / AMDGPU; pass `i % block_dim`). `block_dim` and `log2_warp` are `template()` so the call lowers to a fully-unrolled IR with no runtime branches on warp size. Tests in `tests/python/test_simt.py` parametrise over `{i32, f32} x {block_dim ∈ [32, 128, 256]}`, covering the single-warp short-circuit (32) and the multi-warp shared-mem path (128, 256). Reduce-min/max use a permuted (non-monotone) initialiser so the result genuinely depends on every lane, not just the first or last. Run on `arch=qd.gpu` so CUDA, AMDGPU, Vulkan, and Metal CI all exercise the ports. Stacks on `hp/new-qipc-ops-subgroup` (which provides sized `subgroup.reduce_*`, `_bin_*`, `shuffle_down`, `invocation_id`). Documentation updated in `docs/source/user_guide/block.md`: support table grows two rows; new `block.reduce_{add,min,max}` and `block.reduce_all_{add,min,max}` sections explain the contract, list the args, and show a worked example.

… scan) Ports qipc's `qipc/_src/core/block/scan.py` into Quadrants as `qd.simt.block.inclusive_{add,min,max}` and `qd.simt.block.exclusive_{add,min,max}`, plus the generic `block.{inclusive,exclusive}_scan(..., op, [identity,] dtype)` for custom monoids. Closes the second of three block-tier rows in `qipc_gaps_block.md`. Strategy is CUB's `BLOCK_SCAN_WARP_SCANS`: per-warp Hillis-Steele scan via `subgroup.{_inclusive_scan, _exclusive_scan}` -> last lane of every warp publishes the warp aggregate to a `SharedArray(NUM_WARPS, dtype)` -> `block.sync()` -> every thread sequentially folds the cross-warp prefix and applies its own warp's prefix to its scan value. Cross-warp prefix is computed redundantly on every thread to avoid a second barrier (same trade-off CUB makes). Inclusive: warp aggregate at the last lane is just the inclusive value, written directly. Exclusive: warp aggregate is recovered as `op(exclusive[last_lane], value[last_lane])`, since the exclusive scan does not include the last lane's input. When the block is exactly one warp the cross-warp shared-mem path is short-circuited at trace time. API mirrors the new block reduce: caller passes `tid` explicitly, `block_dim` and `log2_warp` are `template()`. `exclusive_add` derives the additive identity from `value - value` (matches `subgroup.exclusive_add`); `exclusive_min` / `exclusive_max` take an explicit `identity` because no portable type-extreme is derivable from `value` alone (matches `subgroup.exclusive_min`). Tests in `tests/python/test_simt.py` parametrise over `{i32, f32} x {block_dim ∈ [32, 128, 256]}`, asserting per-thread against a sequential CPU oracle. Min / max use a permuted (non-monotone) input so the scan result genuinely depends on every prefix step, not just the trailing or leading element. Run on `arch=qd.gpu` so CUDA, AMDGPU, Vulkan, and Metal CI all exercise the ports. Documentation updated in `docs/source/user_guide/block.md`: support table grows two rows; new `block.inclusive_*` and `block.exclusive_*` sections explain the contract, identity rules, and the cost profile.

Ports qipc's `qipc/_src/core/sort/_block_radix_rank.py` into Quadrants as `qd.simt.block.radix_rank_match_atomic_or`. Closes the third (and final) row in `qipc_gaps_block.md`. Faithful port of CUB's `BlockRadixRankMatchEarlyCounts<WARP_MATCH_ATOMIC_OR>` from `cub/block/block_radix_rank.cuh` (the SM90 onesweep policy): 1. Per-warp digit histograms via shared-memory atomic_add. 2. Per-thread column-sum upsweep across warps (rewrites warp histograms into per-warp running prefixes; yields per-thread bin_count for digit == tid). 3. Block exclusive scan on bin_count (uses our newly-added `block.exclusive_add`). 4. Downsweep: fold the block-wide exclusive prefix into every warp's offset entry. 5. Per-key match via shared-memory atomic_or on a per-digit lane mask; leader (highest set lane) does a single atomic_add on the warp offset and broadcasts via `subgroup.shuffle`; each thread's rank = warp_offset + popc(bin_mask & lanemask_le) - 1. 6. Publish bins + exclusive_digit_prefix to caller-supplied SharedArray outparams; one block.sync() retires before exit so the caller can read them without an extra barrier. Improvements over the qipc port: - WARP_SYNC (qipc's `warp.ballot(qd.cast(1, qd.i32))` hack) is replaced by `subgroup.sync()` which lowers to `__syncwarp` / `OpControlBarrier(ScopeSubgroup, ...)` / `s_barrier` on the appropriate backend. - CUB's `LaneMaskLe()` PTX intrinsic + qipc's `_lane_mask_le` workaround function are replaced by the new portable `subgroup.lanemask_le(invocation_id())` (saves ~13 LoC and sidesteps the lane==31 overflow special case). - `BlockScan::ExclusiveSum` is replaced by the just-added `block.exclusive_add`, which drops the qipc kernel's reliance on a private sequential cross-warp scan. API: `radix_rank_match_atomic_or(key, tid, block_dim, log2_warp, radix_bits, bit_start, num_bits, bins, excl_prefix)` — `key` is one u32 per thread; `bins` and `excl_prefix` are `block.SharedArray((1 << radix_bits,), qd.i32)` outparams. Returns the per-thread stable rank. Currently constrained to `block_dim == 1 << radix_bits`, `log2_warp == 5` (wave32), and `items_per_thread == 1`; multi-item and wave64 paths are future work. Tests in `tests/python/test_simt.py` exercise three input distributions against a CPU oracle: - low_entropy: 16 distinct digits each repeated 16 times (heavy match-path traffic with several lanes per warp colliding). - uniform: full 16-bit uniform random. - uniform_high_bits: digit drawn from bits [8, 16) (covers bit_start > 0). The oracle checks bins, excl_prefix, and the per-thread ranks (uniqueness as a permutation of [0, 256) plus value match). Run on `arch=qd.gpu`. Documentation updated in `docs/source/user_guide/block.md`: support table grows one row, new section explains the constraints, args, cost, and shows a worked example.

`subgroup.sync()` lowers to `OpControlBarrier(ScopeSubgroup, ScopeSubgroup, 0)` on SPIR-V — i.e. the Memory Semantics operand is **0**. Bare control barriers without memory semantics do not publish prior shared-memory writes to other lanes, so on Vulkan / Metal the radix-rank algorithm silently saw stale `atomic_or` / `atomic_add` results across the warp, producing off-by-one ranks (e.g. `actual_ranks[14] == 15` vs `expected[14] == 14` on the uniform_high_bits pattern). Wrap the WARP_SYNC pattern in `_warp_sync_fence()` which issues both `subgroup.sync()` and `subgroup.mem_fence()`. On CUDA, the explicit mem_fence is a redundant `__threadfence_block` — slight overhead but always correct. On SPIR-V, the mem_fence emits a real `OpMemoryBarrier(ScopeSubgroup, AcquireRelease | UniformMemory | WorkgroupMemory)` and restores CUB's `__syncwarp` shared-memory visibility invariant that the algorithm depends on. Replaces three `subgroup_sync()` call-sites in `radix_rank_match_atomic_or`: the histogram-zero retire, the post `atomic_or` retire, and the leader-clear retire. The longer-term fix is to emit the proper memory semantics on `subgroupBarrier` in `spirv_codegen.cpp` (matching the pattern used by `subgroupMemoryBarrier`), but that lives on the subgroup branch / needs its own runtime rebuild — handle here in user code so this PR stays self-contained.

The Vulkan / Metal radix-rank failures were not the SPIR-V subgroupBarrier-without-memory-semantics quirk after all: pairing sync() with mem_fence() was correct for general robustness but did not fix the test. Real bug: ``leader = 31 - clz(cast(bin_mask, i32))`` triggered SPIR-V's GLSL.std.450 FindSMsb on the i32, which for negative values (top bit set) returns the index of the most-significant **0**-bit, not the 1-bit. Concretely, when only the highest lane in a warp has a given digit, bin_mask = 0x80000000; FindSMsb on (i32)-2147483648 returns 30 (bit 30 is the highest 0-bit), so the leader was elected as lane ``31 - 30 = 1`` instead of lane 31. All non-leader lanes in the match group then read warp_offset from lane 1 (which had stale 0 because lane 1 was *not* the actual leader and never atomic_add'd), producing the observed last-lane-of-warp off-by-one ranks. Fix: pass the u32 directly to ``clz`` so SPIR-V dispatches FindUMsb (which returns MSB-of-1 unconditionally and is already cross-backend on the subgroup branch). This matches what CUB's ``__clz`` does on CUDA and what AMDGPU emits via the new amdgcn-clz lowering. Restore the ``_warp_sync_fence`` (subgroup.sync + subgroup.mem_fence) calls in the match phase — they're orthogonal to the clz fix but remain the right thing for memory visibility across the warp on SPIR-V (whose ``OpControlBarrier(ScopeSubgroup, ScopeSubgroup, 0)`` has no memory semantics; pairing it with an explicit ``OpMemoryBarrier(ScopeSubgroup, AcqRel | UniformMem | WorkgroupMem)`` restores CUDA ``__syncwarp``'s shared-memory visibility invariant). CUDA was unaffected because ``__clz`` ignores signedness (counts leading zeros bit-pattern-wise) — the failure only showed up on the SPIR-V GLSL extended instruction set.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3e73bab45

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T20:59:10Z

+    WARP_SIZE = impl.static(1 << log2_warp)
+    NUM_WARPS = impl.static(block_dim // WARP_SIZE)
+
+    inclusive = _subgroup_inclusive_scan(value, op, log2_warp)


Preserve operand order for generic scans

For custom block.inclusive_scan/exclusive_scan users with an associative but non-commutative operator, this produces the wrong prefix order: the reused subgroup scan folds the shuffled lower-lane value as the second operand (op(current, previous)) rather than op(previous, current). The docs for these new generic APIs promise op(v[0], ..., v[i]) and "custom monoids", so matrix-multiply-style monoids will return reversed/scrambled prefixes within each warp; the added add/min/max tests don't catch this because those ops are commutative.

Useful? React with 👍 / 👎.

…vocation_id() On wave32 hardware (CUDA, Metal, RDNA AMDGPU, most Vulkan compute) the two are equivalent: hardware wave size equals the logical 32-lane warp, so ``invocation_id()`` runs 0..31 within each warp. On wave64 hardware (CDNA AMDGPU) the hardware wave is 64 lanes wide but our log2_warp=5 partitions a block into logical 32-lane warps — so within one hardware wave we have *two* logical warps (lanes 0..31 and 32..63 of the wave). ``invocation_id()`` runs 0..63 across that pair, meaning ``invocation_id() == 0`` only fires once per hardware wave instead of once per logical warp. The publish step ``if lane_id == 0: shared[warp_id] = warp_agg`` then skips every odd logical warp's aggregate, which on a 256-thread block produced sums missing 4 of 8 warps — exactly the failure mode the local amdcloud run hit (got 14400 vs expected 32896, the difference being warps 1,3,5,7). Replace ``invocation_id()`` with ``tid & (WARP_SIZE-1)`` so the lane index always means "lane within the logical 32-lane warp". This matches CUB's ``threadIdx.x & 31`` recipe on CUDA wave32 and stays correct when the hardware wave is wider than the logical warp. Note: this fix alone does *not* make ``block.reduce`` correct on wave64 with log2_warp=5 — the per-warp ``_warp_reduce`` / ``subgroup.reduce_*`` step still does 5-iteration shuffle_down trees with offsets 16..1 inside a 64-lane wave, and lane 48's offset-16 shuffle wraps to lane 0 (ds_bpermute address modulo wave bytes) so warp 1 of every wave gets a contaminated aggregate. Making block reduce/scan fully wave64-correct requires either passing log2_warp=6 on CDNA (so the warp size equals the hardware wave) or a width-clip on shuffle_down — both out of scope for this PR. Documented in the test parametrisation; CDNA support is a follow-up. The radix_rank match phase still uses ``invocation_id()`` because it deliberately operates on the hardware wave (the bin_mask atomic_or + clz + popcnt leader pattern is wave-wide by construction; making it wave64-correct would need a u64 bin_mask and is also out of scope).

github-actions · 2026-05-10T21:43:52Z

Total: 2 file(s) changed, +693 -1 code lines.

hughperkins · 2026-05-10T22:20:17Z

 | `block.SharedArray(shape, dtype)`               | yes  | yes    | yes                     |
 | `block.global_thread_idx()`                     | yes  | yes    | —                       |
 | `block.thread_idx()`                            | no   | no     | yes                     |
+| `block.reduce_{add,min,max}(v, tid, ...)`       | yes  | yes    | yes                     |


why do we need to pass in tid?

hughperkins · 2026-05-10T22:20:32Z

+| `block.inclusive_{add,min,max}(v, tid, ...)`    | yes  | yes    | yes                     |
+| `block.exclusive_{add,min,max}(v, tid, ...)`    | yes  | yes    | yes                     |
+| `block.radix_rank_match_atomic_or(...)`         | yes  | yes    | yes                     |
 | `grid.memfence()` (device-scope, see below)     | yes  | no     | no                      |


why is this still in block.md?

hughperkins · 2026-05-10T22:21:52Z

+
+- `value`: per-thread input.
+- `tid`: calling thread's block-local index. Pass `i % block_dim` from a `qd.loop_config(block_dim=...)` kernel, or `qd.simt.block.thread_idx()` on backends that expose it.
+- `block_dim`: threads per block (compile-time `template()`; must be a multiple of `2**log2_warp`).


do we need to pass in block_dim

github-actions · 2026-05-10T22:39:54Z

Diff coverage: 68% · 603 lines, 194 missing

Pulls in the upstream subgroup-branch refresh (block.sync_*_nonzero now emulated cross-GPU on AMDGPU/Vulkan/Metal, mem_sync renamed to mem_fence, block.thread_idx ported to CUDA/AMDGPU, the underscore- prefixed import convention) and merges with the new block reduce / scan / radix_rank work on top. Conflict resolution: - python/quadrants/lang/simt/block.py: union the two import sets and align everything I added on the base branch's underscore convention (`func as _func`, `i32 as _i32`, `u32 as _u32`, `ops as _ops`, `subgroup as _subgroup`). All call sites in the new code now go through `_ops.atomic_add` / `_ops.cast` / `_ops.clz` / `_ops.popcnt` / etc. and `_subgroup.shuffle` / `_subgroup.invocation_id` / `_subgroup.lanemask_le` / `_subgroup.sync` / `_subgroup.mem_fence`. No behavioural change. - docs/source/user_guide/block.md: take the base's wider support table (CUDA / AMDGPU / Vulkan / Metal columns) plus the `yes*` cells for the now-emulated sync_*_nonzero ops, and add the new rows for reduce / scan / radix_rank. The new rows carry `yes**` for AMDGPU with a follow-up footnote explaining the wave32-vs-wave64 log2_warp contract; radix_rank specifically is wave32-only and the cell says so. Also drop the stale `## Grid-scope fence` heading + body that an earlier round had left behind in this file (grid-scope is documented in grid.md, not here). - tests/python/test_simt.py: auto-merged.

github-actions · 2026-05-10T23:13:33Z

Total: 2 file(s) changed, +668 -1 code lines.

github-actions · 2026-05-11T05:37:12Z

Diff coverage: 19% · 612 lines, 497 missing

…p/new-qipc-ops-block

github-actions · 2026-05-12T19:21:45Z

Total: 2 file(s) changed, +668 -1 code lines.

… ops block.reduce*, block.inclusive_*, block.exclusive_*, and block.radix_rank_match_atomic_or now read the calling thread's block-local index via block.thread_idx() internally rather than taking it as a parameter. The change is correctness-neutral and either neutral or a small win performance-wise: on every backend block.thread_idx() lowers to a single builtin register read that LLVM / spirv-opt / Metal-compiler CSE reliably, while non-power-of-two block_dim cases avoid a runtime modulo at the call site. Also collapse the redundant _reduce / _reduce_all / _inclusive_block / _exclusive_block private helpers into the public reduce / reduce_all / inclusive_scan / exclusive_scan: the prior structure had each public op as a thin wrapper that only added the now-removed tid parameter, so the private layer no longer earns its keep. reduce_add / reduce_min / reduce_max etc. continue to delegate to the generic reduce / scan ops.

…ally Every block-scope reduction, scan, and the radix-rank primitive previously took log2_warp as a template parameter, requiring callers to plumb in 5 on wave32 backends or 6 on wave64. Now that subgroup.group_size() and subgroup.log2_group_size() return compile-time Python ints (32 on CUDA / Metal / Vulkan-on-NVIDIA, 64 on AMDGPU), each block op reads the subgroup size itself and the parameter is no longer plumbed at the call site. Adds a compile-time guard (impl.static_assert) at the top of every block op that block_dim is a positive multiple of subgroup.group_size() -- catches misconfigurations like block_dim=32 on AMDGPU (wave64) with a clear error message instead of a silent NUM_WARPS=0 miscompile. The radix-rank op also gains a compile-time assert that subgroup.group_size() is 32: the atomic-OR match path is built on 32-lane i32 ballot masks, and the wave64 path is not yet implemented (parallel to the existing constraint in the docstring).

The base branch already did this sweep across most user-facing docs and docstrings; these mentions slipped in after the original cleanup and from test_atomic.py which the sweep missed. No semantic change.

…es Python ints Inside a @qd.func body, plain Python ops like block_dim % WARP_SIZE get traced into Quadrants Expressions rather than evaluated as Python ints, so the new compile-time guards in reduce / scan / radix_rank were failing with "Static assert with non-static condition" on every backend. Restoring the impl.static(...) wrappers around the template-int arithmetic (WARP_SIZE, NUM_WARPS, BLOCK_WARPS, the static_assert condition, SharedArray shape tuples, and the per-iteration shared-memory indices) forces evaluation at trace time and the constants reach static_assert / SharedArray as plain Python ints again, matching the pre-API-cleanup shape. Pure correctness fix; semantically equivalent to the previous shape.

github-actions · 2026-05-12T20:15:09Z

Total: 3 file(s) changed, +666 -1 code lines.

…k.py uses it Mirrors the existing subgroup._inclusive_scan / _exclusive_scan private helpers in shape -- generic op: template(), log2_size: template(), private because the generic-op contract is fragile and we don't want to invite ad-hoc subgroup reductions from arbitrary kernels. Nothing in subgroup.py routes through the new helper; the typed reduce_add / reduce_min / reduce_max keep their existing hard-coded bodies, so no existing IR moves. block.reduce drops its local _warp_reduce helper and calls subgroup._reduce instead. This brings block.py's three per-subgroup steps into a consistent shape (subgroup._reduce / _inclusive_scan / _exclusive_ scan), removes the duplicated 4-line shuffle tree, and lines up with the existing convention where block.py only calls into private subgroup helpers for the generic-op paths. Identical IR; pure code motion + one call-site rewrite.

block.py is cross-GPU code -- its abstractions (block, subgroup, lane, shuffle, ballot) come from the SPIR-V / Vulkan / portable subgroup model, not from CUDA's warp model. Calling the per-subgroup step a "warp" leaks CUDA terminology into a portable surface. This commit sweeps every cross-GPU identifier and prose mention from "warp" to "subgroup", and leaves only the genuinely CUDA-specific terms intact (``__syncwarp`` in comments documenting CUDA's primitive, and the legacy ``qd.simt.warp.*`` test module which is CUDA-only). block.py identifier renames: WARP_SIZE -> SUBGROUP_SIZE WARP_THREADS -> SUBGROUP_THREADS (radix-rank local) log2_warp -> log2_subgroup NUM_WARPS -> NUM_SUBGROUPS BLOCK_WARPS -> BLOCK_SUBGROUPS warp_id -> subgroup_id warp_idx -> subgroup_idx warp_agg -> subgroup_agg warp_prefix -> subgroup_prefix warp_count -> subgroup_count warp_offset -> subgroup_offset j_warp -> j_subgroup (radix-rank loop var) _warp_sync_fence -> _subgroup_sync_fence CUB stage names: ComputeHistogramsWarp / ComputeOffsetsWarpUpsweep / ComputeOffsetsWarpDownsweep -> ...Subgroup / ...SubgroupUpsweep / ...SubgroupDownsweep (algorithm-internal names, not exported anywhere). block.md: Deletes the stale ``**`` footnote paragraph that claimed AMDGPU callers must pass log2_warp=6 -- the parameter no longer exists; subgroup size is read from subgroup.group_size() at compile time, and the same source compiles correctly on wave32 and wave64. Sweeps prose "warp" -> "subgroup" in the algorithm descriptions and cost expressions. Keeps "subgroup (warp / wavefront)" in the dual-term glossary line. test_simt.py: Sweeps prose in our block-reduce / block-scan / block-radix-rank section headers. Rewrites the stale ``log2_warp is pinned to 5`` comment to reflect the current compile-time auto-detection. Leaves the legacy qd.simt.warp.* tests, the segmented-reduce tests, and the ``__syncwarp`` mention in test_block_sync untouched. Pure rename + doc cleanup; identical IR.

Splits the previously unioned temp-storage into two SharedArrays: smem_offsets (i32, always) and smem_match (i32 on wave32, i64 on wave64). The match phase (step 5) is now compile-time-gated by subgroup size: wave32 keeps the existing 32-bit ballot mask + u32 atomic_or / clz / popcnt; wave64 uses a 64-bit ballot mask + u64 atomic_or (native on AMDGPU LDS) + u64 clz / popcnt, with an inline u64 lanemask_le (subgroup.lanemask_le is u32-only by contract). Wave32 backends never see the i64 path, so portability does not depend on SPIR-V / Metal supporting 64-bit threadgroup atomics.

github-actions · 2026-05-12T21:36:18Z

Total: 4 file(s) changed, +745 -1 code lines.

github-actions · 2026-05-12T22:33:38Z

Diff coverage: 62% · 662 lines, 254 missing

Brings in the AMDGPU RDNA-wave64 cross-half shuffle fix in the subgroup runtime (permlane64-based), plus log2_size=6 absolute-correctness tests for subgroup reduce / scan. Unblocks amddesktop validation of the wave64 path in block.radix_rank_match_atomic_or.

Block reduce / scan / radix-rank impls require ``block_dim`` to be a positive multiple of the subgroup size (enforced by ``impl.static_assert``). AMDGPU is wave64 in Quadrants, so the existing ``block_dim=32`` parametrization is degenerate there: it fails at compile time with ``block.reduce: block_dim must be a positive multiple of subgroup size``. Add ``_skip_if_block_dim_lt_subgroup`` and call it from all 12 block-* dim-parametrized tests. CUDA / Vulkan / Metal still cover the full ``[32, 128, 256]`` matrix; AMDGPU exercises ``[128, 256]``. Also refresh the section-header comment to reflect the wave64 regime.

Picks up '[subgroup] test: lean parameterization for sized reduce / scan tests'.

github-actions · 2026-05-12T23:31:25Z

Total: 4 file(s) changed, +761 -1 code lines.

github-actions · 2026-05-13T00:32:54Z

Diff coverage: 62% · 678 lines, 255 missing

…block_dim by arch Previous parametrization ``block_dim ∈ [32, 128, 256]`` was wave32-shaped: on wave64 ``block_dim=32`` was unsupported and got skipped, and the single-subgroup short-circuit path (BLOCK_SUBGROUPS == 1) was never exercised on AMDGPU because no parameter value landed at ``block_dim == 64``. Switch the parameter axis to ``sg_per_block ∈ [1, 4, 8]`` and derive ``block_dim`` inside each test body from a host-side ``_arch_subgroup_size()``. Each arch now covers its own canonical regimes: wave32 (CUDA / Vulkan-NVIDIA / Metal): block_dim ∈ [32, 128, 256] wave64 (AMDGPU): block_dim ∈ [64, 256, 512] Both arches now exercise the single-subgroup short-circuit + multi-subgroup paths. The ``_skip_if_block_dim_lt_subgroup`` helper is no longer needed (no degenerate parameter values reach the kernel) and is removed.

Three assertion-message wrappings updated by black 25.1.0 (the version pinned in .pre-commit-config.yaml). Pure formatting; no semantic changes.

…ve64 block.md still said radix_rank requires subgroup_size == 32 and that AMDGPU (wave64) is unsupported; that's been fixed. Update the constraints bullet to reflect the wave32 / wave64 dispatch and refresh the shared-memory footprint breakdown to call out the wave-size-specific match-mask region (4 KiB i32 on wave32, 8 KiB i64 on wave64).

github-actions · 2026-05-13T10:21:12Z

Total: 4 file(s) changed, +759 -1 code lines.

github-actions · 2026-05-13T11:21:39Z

Diff coverage: 62% · 676 lines, 254 missing

github-actions · 2026-05-13T12:58:08Z

Total: 4 file(s) changed, +759 -1 code lines.

github-actions · 2026-05-13T13:46:30Z

Diff coverage: 62% · 686 lines, 259 missing

hughperkins added 5 commits May 10, 2026 12:44

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

hughperkins commented May 10, 2026

View reviewed changes

hughperkins force-pushed the hp/new-qipc-ops-block branch from 7f2538b to 009aba4 Compare May 10, 2026 22:40

hughperkins force-pushed the hp/new-qipc-ops-block branch from 009aba4 to 351017b Compare May 10, 2026 22:41

hughperkins changed the title ~~[feat] Block-level reduce, scan, and radix rank primitives~~ [feat] New QIPC ops for block May 12, 2026

hughperkins changed the title ~~[feat] New QIPC ops for block~~ [GPU] New QIPC ops for block May 12, 2026

hughperkins added 2 commits May 12, 2026 11:46

Merge remote-tracking branch 'origin/hp/new-qipc-ops-subgroup' into h…

b97f308

…p/new-qipc-ops-block

Merge remote-tracking branch 'origin/hp/new-qipc-ops-subgroup' into h…

8d9f0aa

…p/new-qipc-ops-block

hughperkins added 4 commits May 12, 2026 12:22

[doc] Sweep 'trace time' -> 'compile time' in block / test_atomic

f4f8db2

The base branch already did this sweep across most user-facing docs and docstrings; these mentions slipped in after the original cleanup and from test_atomic.py which the sweep missed. No semantic change.

hughperkins added 2 commits May 12, 2026 13:17

hughperkins force-pushed the hp/new-qipc-ops-block branch from 5a2f552 to 5d1d17e Compare May 12, 2026 21:00

hughperkins force-pushed the hp/new-qipc-ops-block branch from 5d1d17e to f85c12c Compare May 12, 2026 21:00

hughperkins force-pushed the hp/new-qipc-ops-block branch from f85c12c to cd9e546 Compare May 12, 2026 21:04

hughperkins added 3 commits May 12, 2026 15:47

Merge origin/hp/new-qipc-ops-subgroup into hp/new-qipc-ops-block

20a68d7

Picks up '[subgroup] test: lean parameterization for sized reduce / scan tests'.

hughperkins added 3 commits May 13, 2026 02:39

[block] style: black 25.1.0 reformatting of merged subgroup tests

60265cf

Three assertion-message wrappings updated by black 25.1.0 (the version pinned in .pre-commit-config.yaml). Pure formatting; no semantic changes.

Merge branch 'hp/new-qipc-ops-subgroup' into hp/new-qipc-ops-block

59f831a

Conversation

hughperkins commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Notes / gotchas resolved during development

Tests

AMDGPU CDNA wave64 follow-up (not blocking this PR)

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

hughperkins May 10, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins May 10, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins May 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 10, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hughperkins commented May 10, 2026 •

edited

Loading